5.1 - Scraping data from the web

The internet is a huge repository of data that has revolutionized our access to information across a large number of domains. However, data online is usually formatted for visual consumption and does not follow a standard structure. Some service-oriented websites such as news organiaztions and social media platforms provide structured ways to access their data in the form of an Application Programming Interface (API). API's provide developers a set of tools for requesting specific kinds of data, and return it in a structured format. These tools allow users direct access to the data generated by web services without going through the online front-end.

While official API's are the easiest way to get structured data from the web, they often have use restrictions that limit the amount of data that can be obtained. Furthermore, many interesting online data sources do not provide an API at all. In these cases, we can use 'webscraping' tools to take information directly from front-end websites and structure it in a useable data format. This can be challenging due to the difference in how data is presented across different sites, but has the potential to give us access to many more sources of information which do not provide a structured API.

One of the most popular tools for webscraping is a library for Python called Beautiful Soup. To get some practice with this library you can follow along with a recorded tutorial at this Youtube channel. You can use this notebook to follow the code examples in the tutorial and keep them for future reuse. You can also consult the full documentation of the Beautiful Soup library which has many more examples.


In [ ]:
# we start by importing the beautiful soup library
from bs4 import BeautifulSoup